This is the main analysis for my project where I am looking to explain what makes a song popular on Spotify. In this document there is:
The Variables - Audio Features
Here are the audio features which we’re going to be mostly focusing on. How do these help explain the popularity of a song?
Acousticness - detects the presence of acoustic instruments
Danceability - based on rhythm stability and beat strength
Energy - measure of intensity and activity
Instrumentalness - the higher the score, the less vocals the track contains
Liveness - detects the presence of an audience or if the track was recorded live
Loudness - how loud the track is
Speechiness - detects the presence of spoken word, giving rap music a higher score than opera
Valence - how positive the track is, the higher the score the generally happier the feel of the track
This is an example of the (difference in means) hypothesis testing which was carried out on each audio feature. Please see hyp_testing.Rmd for full hypothesis testing.
Two sample - independent tests
H0: The mean danceability in 1960s is the same as the mean danceability in 2010s
Ha: The mean danceability in 1960s is less than the mean danceability in 2010s
H0: The difference in means in 0 Ha: danceability2020 - danceability1960 <> 0
With a p-value < 0.001 we can reject our null hypothesis in favour for the alternative. So we can say with confidence that the difference in means between 1960 and 2010 is statistically significant.
This was the case with all of the Audio Features.
We can see a gradual rise in Danceability and Loudness, with a massive drop in Acousticness.
Near the bottom we can see a rise in Speechiness from the 1980s, possibly due the rise in popularity of rap music. And the Instrumentalness decreases over time, showing people are listening significantly less instrumental music than they were in the 1960s.
To help me answer my question of what makes a song “popular” on spotify, I decided to build an explanatory linear regression model. This type of analysis is used to determine the strength of the relationship between a response variable and multiple explanatory variables.
popularity = b0 + b1x1 + b2x2 + b3x3….bnxn
While working on this model building I used a 80-20 Train, Test split method. Meaning I worked on 80% of the data then testing my outcome on the remaining 20%.
To start this process I plot popularity against each one of my variables or possible explanatory variables and find the strongest correlation.
For full linear model build see linear_model_build.Rmd
model_1a <- lm(popularity ~ year,
data = train_lm)
summary(model_1a)
##
## Call:
## lm(formula = popularity ~ year, data = train_lm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.158 -7.444 -1.631 5.832 54.369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1223.657496 3.764367 -325.1 <0.0000000000000002 ***
## year 0.636047 0.001892 336.2 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.36 on 97323 degrees of freedom
## Multiple R-squared: 0.5374, Adjusted R-squared: 0.5374
## F-statistic: 1.131e+05 on 1 and 97323 DF, p-value: < 0.00000000000000022
After running my model I’m looking at 3 factors
The P-value - is this variable making a significant difference. If the P-value is below the significance level of 0.05 then we can reject the null hypothesis and conclude that correlation between the variables is significant
The R^2 - is a measure that indicates how much of the variation of popularity is explained by the year
The adjusted R^2 - compensates for the addition of variables. So as we’re building an explanatory model, we don’t want this to drop much lower than the R^2.
I used anova (Analysis of Variance) tests to check that the difference between my new model and previous models was significant.
anova(model_3a, model_2c)
popularity ~ year + danceability + loudness +
(-liveness) + explicit
model_6a <- lm(popularity ~ year + danceability + loudness + liveness + explicit,
data = train_lm)
summary(model_6a)
##
## Call:
## lm(formula = popularity ~ year + danceability + loudness + liveness +
## explicit, data = train_lm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.595 -7.392 -1.589 5.839 54.268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1173.381097 4.170239 -281.370 <0.0000000000000002 ***
## year 0.607019 0.002155 281.715 <0.0000000000000002 ***
## danceability 0.028390 0.002044 13.886 <0.0000000000000002 ***
## loudness 0.085082 0.004765 17.855 <0.0000000000000002 ***
## liveness -0.036995 0.001859 -19.903 <0.0000000000000002 ***
## explicit 1.064796 0.118767 8.965 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.3 on 97319 degrees of freedom
## Multiple R-squared: 0.5431, Adjusted R-squared: 0.543
## F-statistic: 2.313e+04 on 5 and 97319 DF, p-value: < 0.00000000000000022
So here we have the final model. I stopped adding variables as the Adjusted r^2 started dropping and our multiple r^2 was barely going up.
All of our P-Values are significant and we have a Multiple R^2 of 0.55, with an adjusted R^2 also of 0.55. This means that 55% of the variance in popularity is explained our other variables.
This suggests that the model has moderate explanatory power, as about half of the data points can be accounted for by the linear regression line. However, it also tells us that there is a large portion of the variability that remains unexplained and might be attributed to other factors not included in the model.
Logistic regression is a statistical analysis method to predict, or explain a binary outcome, such as yes or no, based on prior observations of a data set
Rather than using the popularity score I used a variable which I created called is_popular. This splits the data in to a logical type, so it’s TRUE if the song has a popularity score of 50 and above, and FALSE if below 50
This was built in a similar to the linear model. I look for correlations, and I add them to my model 1 at a time, checking they are significant. The main difference is that I’m looking for for a high AUC score this time, rather than the multiple R^2 I was looking for in the linear model
For full logistic model build see logistic_model_build.Rmd
I decided to no longer include the year or decade the song was released for this model. It had such a large influence on the linear model I thought it would be more interesting to see how the logistic model fared without it. Also, if we’re building a model to assist with the writing of a current day “popular” song, then including variables such as year and decade are of no help
is_popular ~ loudness + explicit + danceability + no_of_artists
model_4_final <- glm(is_popular ~ loudness + explicit + danceability + no_of_artists,
family = "binomial",
data = train_log_mod)
summary(model_4_final)
##
## Call:
## glm(formula = is_popular ~ loudness + explicit + danceability +
## no_of_artists, family = "binomial", data = train_log_mod)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3163 -0.8544 -0.6608 1.1311 3.7310
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.8298932 0.0998028 -78.45 <0.0000000000000002 ***
## loudness 0.0787007 0.0012278 64.10 <0.0000000000000002 ***
## explicit 1.0275451 0.0238377 43.11 <0.0000000000000002 ***
## danceability 0.0098224 0.0004547 21.60 <0.0000000000000002 ***
## no_of_artists 0.1895800 0.0115352 16.43 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 121019 on 97324 degrees of freedom
## Residual deviance: 110098 on 97320 degrees of freedom
## AIC: 110108
##
## Number of Fisher Scoring iterations: 4
## Area under the curve: 0.7219
Here is a Genre analysis showing the proportional change of the Top 5 Genres from 1960 - 2020
I kept the Genres and Number of Followers out of my model building as I was missing almost 50% of the data
Here we see a massive decline in the proportion of folk and soul songs from 1960s and 70s to 2010s
A rise in Rock music from the 60s with a spike in the 1980s then dropping slowly to 2010s
A gradual rise in Rap and Pop over the decades, with Pop just overtaking Rap. Combined, they make up around 40% of the songs released in the 2010s
The best model is my logistic model. It gave me a reasonable AUC score of 0.72 and I feel it really works well as an explanatory model
From my Genre analysis I discovered that the most common genres of the last decade are Pop & Rap
There is no escaping how heavily weighted the popularity score is towards new music. I plan to recreate this project using only new music. I feel this will give me a better insight in to what gives a song a high popularity score.